Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds a new skill for retrieving and analyzing test failures from Azure DevOps builds and Helix test runs in the dotnet/runtime CI pipeline.
Changes:
- Introduces documentation and tooling to help investigate CI test failures
- Provides PowerShell script to query Azure DevOps and Helix APIs for failure information
- Enables querying by build ID or PR number with optional detailed log fetching
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| .github/skills/azdo-helix-failures/SKILL.md | Documents the skill's purpose, usage examples, manual investigation steps, and common failure patterns |
| .github/skills/azdo-helix-failures/Get-HelixFailures.ps1 | PowerShell script that queries Azure DevOps for failed jobs and retrieves Helix console logs |
- Fix unnecessary backtick escaping in string interpolation - Rename $matches to $urlMatches/$failureMatches to avoid shadowing automatic variable - Add gh CLI dependency check with helpful error message - Add -TimeoutSec parameter (default 30s) for API calls - Add -MaxFailureLines parameter (default 50) for configurable output - Improve Format-TestFailure to detect end of stack trace via empty lines - Add Write-Verbose output for debugging - Update SKILL.md with new parameters, prerequisites, and org/project documentation
- Add Extract-BuildErrors function to parse build logs for error patterns - Add Get-FailureClassification function with known patterns: - macOS clang module cache/dsymutil issues - NativeAOT size regressions - NuGet package errors - Device infrastructure issues - Helix timeouts - C# and MSBuild compilation errors - Expand Format-TestFailure patterns for better Helix log extraction - For non-Helix failures, now extracts actual errors and provides classification, suggested action, and transient failure detection
- Add -Repository parameter to support repos other than dotnet/runtime - Add -ContextLines parameter for error context - Reorder error patterns (specific before general) to avoid overmatch - Fix Select-Object ordering (First then Unique) - Add classification to Helix test failures, not just build failures - Expand Format-TestFailure to capture multiple failures (up to 3) - Add new failure patterns: - OutOfMemoryException (transient) - StackOverflowException - Assertion failures - Test timeouts - Network connectivity issues
- Use ${LogId} syntax to prevent PowerShell parsing $LogId? as ternary
- Normalize line breaks in log content before extracting Helix URLs
- Update URL pattern to handle workitem names with special chars
|
It's looking like I should close #123863 in favor of yours :-) |
Script improvements: - Add -HelixJob parameter for direct Helix job queries - Add -WorkItem parameter to query specific work items - Add Get-HelixJobDetails, Get-HelixWorkItems, Get-HelixWorkItemDetails functions - Show work item artifacts, machine name, duration, exit code - List failed work items when querying a job without -WorkItem Documentation improvements (from PR dotnet#123863): - Add build definition IDs table (129, 133, 139) - Add failure classification table with all patterns - Add Helix API curl examples - Add artifact download documentation - Add environment variable extraction examples - Add links to triaging guide, area owners, Helix swagger - Document -HelixJob and -WorkItem parameters
- Add Extract-TestRunUrls function to parse 'Published Test Run' URLs - Add Get-LocalTestFailures to detect non-Helix test failures - Add classification for local xUnit test failures - Update main flow to report local test failures with links - Update SKILL.md with new documentation
- Add Get-AzDOTestResults function using az devops invoke - Fetch actual failed test names when az CLI is available - Show up to 10 failed test names with count
- Add Extract-HelixLogUrls function to parse Helix console log URLs - Display work item names with direct log links for Helix failures - Deduplicate URLs to avoid showing duplicates
All examples now use ./scripts/Get-HelixFailures.ps1 relative to skill directory.
- Add try/catch to cache cleanup with verbose logging on failure - Add comment explaining allowed chars in search term regex - Use TryParse for build ID parsing instead of direct cast - Validate headSha format (40 hex chars) before using in API call
- Fix FileName property in manual-investigation.md (was Name) - Fix dotnet/sdk description: uses both local and Helix tests
|
I prefer to be able to iterate on the log files after they are downloaded
I'm pushing my improvements into https://github.com/pavelsavara/runtime/tree/helix-failures_pavel |
|
@jkotas @stephentoub @lewing @pavelsavara I think what we have in this skill is super useful and if everyone agrees, we should get this in. Iterating in the PR is good, but I suspect we'll be able to offer improvements until the cows come home. Follow-ups are cheap. |
Think of the caching as a way to buffer the network calls, not preserve results. Longer durations caused a lot of issues with stale data due to the naive url based cache. If you want to preserve the info, tell the llm that in the prompt. Similarly, you can ask for the output format you like in the prompt and/or combine this with another skill that structures things how you prefer. |
Your script is not exposing full log files and so, the LLM or any other script trying to integrate with this would have to guess schema of your cache and lift it from there (and get busted 30 seconds later by cache purge). Let's ignore my branch. On further reflection on it I think it would be good to separate the download/graph/crawling ability of your script from the post-processing ability. I have my own such scripts, but I thought we could develop this as a team, without forcing (me) into particular style of log processing. But as Steve said, this is very useful as it is and I don't want to block merging it. |
It had problems running the script, gave up and tried to do it without that script "by bare hands".
|
|
Nice session (with my branch) https://gist.github.com/pavelsavara/af35aedd4d0c7c23c3e1a9971af73a8d |
How do I fix that? It fails with file not found but the file |
copilot help config
copilot help environment
copilot help permissions
|
Investigation: Script Path ReferencesI reviewed the skill's script path references against the Anthropic Agent Skills documentation. Finding: The current paths using
The paths in SKILL.md and reference docs are all consistent and follow this pattern. Regarding @jkotas' issue: Looking at the log, every PowerShell command failed with "File not found" - including basic commands like Note: When using skills, agents may either:
How the agent resolves these paths depends on the Copilot CLI implementation. The skill documentation correctly documents paths relative to the skill directory as specified by the standard. |
|
@jkotas I know it sounds odd but try explaining to copilot what it is doing wrong checked other documentation as well
|
- Enhance description to include 'when to use' keywords per spec requirement - Remove misleading note about relative paths (agents resolve from skill root)
| # Check if failures correlate with PR changes | ||
| $hasCorrelation = $false | ||
| foreach ($failure in $allFailuresForCorrelation) { | ||
| $failureText = ($failure.Errors + $failure.HelixLogs + $failure.FailedTests) -join " " |
There was a problem hiding this comment.
The recommendation logic at the end of the script uses string concatenation to join failure properties that may contain arrays. At line 1973, when Errors, HelixLogs, or FailedTests are arrays, the join operation will work correctly. However, there's a potential issue: if any of these properties are null or not initialized properly earlier in the code, this could cause unexpected behavior.
Consider adding null checks or ensuring these properties are always initialized as empty arrays when creating the failure objects throughout the script (e.g., at lines 1698, 1800, 1866).
| $failureText = ($failure.Errors + $failure.HelixLogs + $failure.FailedTests) -join " " | |
| $allFailureParts = @($failure.Errors) + @($failure.HelixLogs) + @($failure.FailedTests) | |
| $failureText = ($allFailureParts | Where-Object { $_ }) -join " " |
.github/skills/azdo-helix-failures/scripts/Get-HelixFailures.ps1
Outdated
Show resolved
Hide resolved
The problem was with permissions. I had to give copilot extra permissions to make it work - thanks @pavelsavara for the tip. I am not sure whether I am comfortable doing that given what I have seen doing it when it went off track - creating python scripts in random places, etc. - but that's separate problem. I think I am going to stick to running it in github copilot sandbox for now at least. With the extra permissions, it produced: This is invalid conclusion since this PR have not modified System.Private.Corelib and thus it is very unlikely that it can introduce invalid IL in CoreLib. I have no problem with merging this if others find it helpful. |
try asking it to support its conclusions. |
I find it very useful even though in some instances conclusions it draws can be varying degrees of off. When that happens, conversing with it allows course correction and will lead to a better place. |
The script's heuristic-based recommendations may be incomplete. Instruct agents to review detailed findings and form their own analysis before presenting results to users.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
| else { | ||
| # No Helix tasks - this is a build failure, extract actual errors |
There was a problem hiding this comment.
Missing closing brace: The if ($task.log) block that starts at line 1785 is never closed. There should be a closing brace after line 1842 (which closes the if ($logContent) block) and before line 1843 (which closes the foreach loop). This missing brace causes the else block at line 1846 to be incorrectly positioned outside the try block, resulting in a PowerShell syntax error.
| else { | |
| # No Helix tasks - this is a build failure, extract actual errors | |
| else { | |
| # No Helix tasks - this is a build failure, extract actual errors |
Summary
Adds an AI agent skill for analyzing Azure DevOps and Helix test failures across dotnet repositories. When asked to investigate CI failures, the skill teaches Copilot how to query APIs, extract failure details, and provide actionable recommendations.
Features
Core Functionality
Intelligent Analysis
-SearchMihuBot)Smart Recommendations
Provides actionable guidance at the end of analysis:
Usage
Example Prompts
Structure
Related
Supersedes #123863 - this PR includes a PowerShell script for automation rather than just documentation, plus additional features like Build Analysis integration, known issue search, and smart retry recommendations.